Red Wine by Steven Ko

Univariate Plots Section

Data Variable:

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"              "rating"

Data Structure:

## 'data.frame':    1599 obs. of  14 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  $ rating              : chr  "normal" "normal" "normal" "normal" ...

Data Summary:

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality         rating         
##  Min.   : 8.40   Min.   :3.000   Length:1599       
##  1st Qu.: 9.50   1st Qu.:5.000   Class :character  
##  Median :10.20   Median :6.000   Mode  :character  
##  Mean   :10.42   Mean   :5.636                     
##  3rd Qu.:11.10   3rd Qu.:6.000                     
##  Max.   :14.90   Max.   :8.000

Quality of red wine:

## [1] 5 6 7 4 8 3

start with the distribution of individual variable:

fixed.acidity:

There are some outliers above 15 The distribution has high concentratin around 8

volatile.acidity:

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

There are some outliers above 1.3, and two peaks at 0.4 and 0.6

citric.acid:

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

There are some outliers above 1, and many of the red wines have value 0

Check it

## 
## FALSE  TRUE 
##  1467   132
## [1] 0.08255159

about 8% of the red wine has value 0

residual.sugar:

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

There are some outliers above 10, and hight concentration around 2.3

chlorides:

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

There are some outliers above 0.5, and hight concentration around 0.08

free.sulfur.dioxide:

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

There are some outliers above 60, and most of the values are around 5~20

total.sulfur.dioxide:

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

There are some outliers above 175, and most of the values are around 25~75

density:

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Warning: position_stack requires constant width: output may be incorrect

There is a central peak at 0.997, looks like has normal distribution

pH:

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

There is a central peak at 3.3, also looks like have normal distribution

sulphates:

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

There are some outliers above 1.4

alcohol:

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

There is a peak around 9.5 and rapid decrease after it, besides there no red wine has value below 8

quality:

Most of the wine quality are around 5 and 6

Univariate Analysis

What is the structure of your dataset?

There are 1599 observations and 12 features. One categorical feature(quality) and others are numerical features.

What is/are the main feature(s) of interest in your dataset?

I think the main feature is quality. People care about quality rather than other features

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Those other features may be all helpful. Because those features may have influence on taste. For example residual.sugar will indicate how sweet is the wine. And the acid features will relate to the acit flavour. SO2 also being regard as an important ingredient in red wine whick will influent taste

Did you create any new variables from existing variables in the dataset?

I create rating variable based on quality. Wine with quality below 5 as bed, and above 7 as good, others will regard as normal

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Citric.acid has a lot of data with value 0, it’s really unexpected.

Bivariate Plots Section

In order to quickly get info of each pair variable using ggpair:

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Warning: position_stack requires constant width: output may be incorrect
## Warning: position_stack requires constant width: output may be incorrect
## Warning: position_stack requires constant width: output may be incorrect
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

From first of view, it seems quality has more strong correlation with Alcohol and volatile.acidity, sulphates, citric.acid

Alcohol has strong negtive correlation with density, more alcohol in the wine will cause density to be lower

Sulphates and chlorides has strong correlation, so as Sulphates and citric.acid pH has strong correlation with fixed.acidity, citric.acid, volitile.acidity, it’s not surprised

total.sulfur.dioxide and free.sulfur.dioxide has strong correlation citric.acid and volatile.acidity and fixed.acidity all has strong correlation with each other.

Create quality with Alcohol and volatile.acidity, sulphates, citric.acid plots

It seems higher alcohol has better quality

lower volatile.acidity will have better quality

It seems high quality wine has a little higher sulphates

It shows high quality wine has higher citric.acid

Now let’s check sugar with quality

It’s surprised that wine quality has no strong correlation with sugar

Since fixed.acidity, volatile.acidity and citric.acid has strong correlation with pH and quality It’s strange that pH and quality do not have strong correlation Plot to check it

It seems high quality has a little bit lower pH, however there are many outliers.

alcohol and desity has negtive correlation

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

It seems higher alcohol has better quality

Lower volatile.acidity will have better quality

It seems high quality wine has a little higher sulphates

It shows high quality wine has higher citric.acid

From above it seems higher quality wine has more acid, so the acid should be lower, however there are many outliers.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Alcohol has strong negtive correlation with density, more alcohol in the wine will cause density to be lower

Sulphates and chlorides has strong correlation, so as Sulphates and citric.acid

pH has strong correlation with fixed.acidity, citric.acid, volitile.acidity, it’s not surprised

total.sulfur.dioxide and free.sulfur.dioxide has strong correlation

citric.acid and volatile.acidity and fixed.acidity all has strong correlation with each other.

What was the strongest relationship you found?

From correlation plot density has strongest negtive correlation with fixed.acidity

Multivariate Plots Section

First I plot alcohol & density over rating

We can see that, bad rating wine locates on left upper, and good rating wine locates on right bettom

Then we check acid with quality

I can’t get any insight from this

It seems higher citric.acid with lower pH will be better quality wine, but the difference is small

It seems lower volatile.acidity with lower pH will be better quality wine

Now check with sulphates:

sulphates higher will be better, and it seems no difference with density

higher alcohol with higher sulphates will be better

sulphates higher will be better, and it seems no difference with pH

lower volatile.acidity alcohol with higher sulphates will be better

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

It seems high alcohol and high sulphates will get better quality wine, as same as, lower volatile.acidity alcohol with higher sulphates will be better

Were there any interesting or surprising interactions between features?

Since acid features affect wine quality, I expect pH with acid features will disclose some info However, it cannot see pH with acid features has any clear effects on quality

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.


Final Plots and Summary

Plot One

Description One

People may think residual sugar will affect the quality of wine, but it’s the wrong concept. This plot shows that residual sugar doesn’t affect the quality

Plot Two

Description Two

We can see that, bad rating wine locates on left upper side, which means has low alcohol and high density And good rating wine locates on right bettom side, which means has high alcohol and low density

Plot Three

Description Three

We can see that, good quality wine locates on left upper side, which means has lowwer volatile.acidity and higher sulphates And low quality wine locates on right bettom side, which means has higher volatile.acidity and lowwer sulphates


Reflection

At first I try to show the summary of the data, to get basic understanding of the data. For example, how many variables does the data have, what’s the variable’s min,mean,max… Then, in order to get more info of individual variables, I try explore individual variable, and plot some individual varialbe histogram. This plots show how the variable distributed, whether or not there are many outliers, does the data collected make sense(reasonable). For example, it is weired that the 8% of data has citric.acid value being 0.

In “bivariate” section, I try to show each valuable’s influence on ‘quality’. It surprised me that ‘residual.sugar’ has no notable influence on wine’s quality. And although the acid related features has high correlation with wine’s quality, ‘pH’ does’t have this relation. Maybe because there are too many factors can affect the pH value.

In ‘multivariate’ section, I try to investigate whether combination of variables can affect wine quality. For example, most of better quality wine have low volatile.acidity and high sulphates. But here, I get an question. Since in the ‘bivariate analysis’, it already shows lower volatile.acidity has better quality, and higher sulphates has better quality. Can we say volatile.acidity and sulphates strengthen each other for quality? I think for better answer this question, need to know more analysis knowledge.

This wine data contains 12 variables, including 11 physicochemical valuables and one varialble ‘quality’ which we care about. The number of observations are 1599. So this can be viewed as an regression task. For enrich the analysis, it can use linear regression or other machine learning task, by input the features to predict the output(quality).